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Abstract 

Background: Differentiation of primordial germ cells into mature spermatozoa proceeds through multiple stages, 
one of the most important of which is meiosis. Meiotic recombination is in turn a key part of meiosis. To achieve 
the highly specialized and diverse functions necessary for the successful completion of meiosis and the generation 
of spermatozoa thousands of genes are coordinately regulated through spermatogenesis. A complete and unbiased 
characterization of the transcriptome dynamics of spermatogenesis is, however, still lacking. 

Results: In order to characterize gene expression during spermatogenesis we sequenced eight mRNA samples 
from testes of juvenile mice from 6 to 38 days post partum. Using gene expression clustering we defined over 
1,000 novel meiotically-expressed genes. We also developed a computational de-convolution approach and used it 
to estimate cell type-specific gene expression in pre-meiotic, meiotic and post-meiotic cells. In addition, we 
detected 13,000 novel alternative splicing events around 40% of which preserve an open reading frame, and found 
experimental support for 159 computational gene predictions. A comparison of RNA polymerase II (Pol II) ChlP-Seq 
signals with RNA-Seq coverage shows that gene expression correlates well with Pol II signals, both at promoters 
and along the gene body. However, we observe numerous instances of non-canonical promoter usage, as well as 
intergenic Pol II peaks that potentially delineate unannotated promoters, enhancers or small RNA clusters. 

Conclusions: Here we provide a comprehensive analysis of gene expression throughout mouse meiosis and 
spermatogenesis. Importantly, we find over a thousand of novel meiotic genes and over 5,000 novel potentially 
coding isoforms. These data should be a valuable resource for future studies of meiosis and spermatogenesis in 
mammals. 
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Background 

Spermatogenesis is a complex multistage process invol- 
ving thousands of genes, and it is especially difficult to 
study in mammals [1-3]. In essence it is the process in 
males by which diploid cells give rise to haploid gametes. 
Briefly, germline cells are derived from primordial germ 
cells (PGC), that give rise to primitive spermatogonia 
A (pSGA). Some daughter cells of mitotically dividing 
pSGAs differentiate into more advanced spermatogonia 
A subtypes, which eventually give rise to spermatogonia 
B and proceed through the stages of meiotic spermato- 
cytes into spermatids and mature sperm [1,4,5]. Meiosis 
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is central to gametogenesis, and in male mice it starts 
about 8 days post partum (dpp). The outcome of meiosis 
of a single spermatogonium is four haploid spermatids. 
Meiosis lasts about two weeks in mice and consists of 
two divisions, I and II. The primary spermatocytes re- 
plicate their maternal and paternal chromosomes and 
then, in meiosis I, undergo a unique process in which 
the homologous parental chromosomes recombine with 
each other. Recombination generates genetic diversity 
and ensures the proper segregation of chromosomes. Er- 
rors in recombination can lead to either too few or too 
many chromosomes in the spermatids, a phenomenon 
referred to as aneuploidy. The secondary spermatocytes 
divide further in meiosis II which is comparatively very 
brief, a matter of hours, and is similar to a mitotic divi- 
sion. Meiosis II yields haploid spermatids which proceed 
through spermiogenesis resulting in sperm (Figure 1). 
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Figure 1 A time course of spermatogenesis. Primordial germ cells (PGC's) are the germline progenitors. They give rise to various types of 
spermatogonia, which proliferate and differentiate into spermatocytes. Meiosis and genetic recombination occur in spermatocytes, which 
become haploid spermatids after the completion of meiosis. Spermatids undergo the process of spermiogenesis until their maturity, which 
includes elongation, genome condensation and formation of a flagellum. The approximate timeline of the first wave of spermatogenesis in mice, 
in days after birth (dpp) is indicated. The figure is adopted from [45] by the authors holding the copyright. 



Recombination of homologous chromosomes takes place 
during prophase of meiosis I, which consists of five sta- 
ges: leptotene, zygotene, pachytene, diplotene and dia- 
kinesis, in that order, the longest of which is pachytene. 
In the first wave of spermatogenesis, a relatively syn- 
chronous first cycle of sperm production after birth, 
pachytene begins at around 12-14 dpp. During this wave, 
the proportions of the various cell types in the testes 
change over time, while consecutive waves of spermato- 
genesis gradually become desynchronized. The process 
of germ cell development is directly assisted by various 
somatic cells present in testes. Despite the critical im- 
portance of spermatogenesis, its complex developmental 
program and its accompanying changes in gene expres- 
sion are still not fully understood. 

Two alternative approaches have been used to study 
gene expression during spermatogenesis. One uses cell 
sorting to separate and characterize the various cell types 
[6,7]. Another is based on examining gene expression in 
the different cell populations present throughout the first 
wave of spermatogenesis [8,9]. Both of these sets of studies 
have used microarrays. Until recently, high-throughput 
RNA sequencing approaches have been used to charac- 
terize spermatogenesis in a much more restricted way fo- 
cusing either on a single developmental time point [10,11] 
or on comparisons between pairs of neighboring time 
points [12]. A recent paper [13] utilized RNA-Seq of sor- 
ted cells to study the transcriptome of mouse testes. RNA 
sequencing has several important advantages compared to 
microarrays - better sensitivity, a greater dynamic range 
and the ability to detect every expressed gene or splicing 
variant/isoform, even if previously unknown [14]. 

Here we analyzed gene expression during first wave of 
spermatogenesis in murine pups using RNA-Seq in or- 
der to discover novel genes and isoforms active in sper- 
matogenesis and meiosis. We analyzed testis samples 



from 6 to 38 day old mice with two-day sampling inter- 
vals between 10 and 20 dpp to improve the coverage of 
meiotic samples. We classified gene expression profiles 
and compared our results to previous microarray-based 
studies [6-9]. This comparison allowed us to identify 
genes that were not previously described as meiotically- 
expressed in high- throughput studies. We then devel- 
oped a deconvolution algorithm to computationally 
determine cell type-specific gene expression and esti- 
mated gene expression levels in somatic cells, pre- 
meiotic spermatogonia, spermatocytes and spermatozoa. 
We validated our predictions by comparing them to the 
experimentally derived measurements of mRNA levels in 
cell-sorted samples from whole testis [13]. RNA-seq data 
were further mined to describe alternative splicing pat- 
terns and alternative polyadenylation site usage during 
spermatogenesis. Using our RNA-seq data we further 
evaluated computationally predicted gene models. Finally, 
we measured the genome-wide distribution of RNA Pol II 
(Pol II) at two different time points and compared it with 
gene expression in spermatogenesis. 

Results 

RNA-seq of mouse spermatogenesis 

In order to study gene expression during spermatogen- 
esis, we sequenced mRNA samples from whole testes of 
pre-pubertal mice at 10, 12, 14, 16, 18 and 20 days post 
partum (dpp) which include the meiotic stages of the 
first wave of spermatogenesis. We also analyzed pre- 
meiotic (6 dpp) and adult (38 dpp) samples. We generated 
between 42 and 96 millions of reads for each sample 
(Additional file 1: Table SI). There were between 5.5 and 
9.8 million genomic locations with uniquely aligned 
reads (Additional file 1: Table SI). To estimate the qua- 
lity of our protocol of mRNA preparation and sequen- 
cing, we plotted the average read coverage of all genes 
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relative to the position within the transcript (Additional 
file 1: Figure SI). These distributions are mostly even 
for all samples indicating that the RNA samples were 
subjected to limited degradation. 

Overall, we find that around 85% of protein coding 
genes might be expressed at each time point (that is, 
have at least one mapped read). Such a high proportion 
of expressed genes may be at least in part explained by 
the presence of many cell types in our samples [15-17]. 
A stricter criterion of gene expression is the observation 
of exon splicing: in each sample, we see spliced reads in 
over 50% of protein-coding genes (Additional file 1: 
Table S2). Although the vast majority of genes are ex- 
pressed in our samples, 1,008 protein coding genes were 
not detected in any of the samples. Olfactory receptor 
(OLFR) genes were highly enriched among those non- 
expressed genes (p-value <4T0'^^^). Although some ol- 
factory receptor genes are expressed in various tissues 
including testis, most of the members of the superfamily 
are expressed only in olfactory sensory neurons [18,19]. 
Thus, the observed enrichment of OLFR genes is con- 
sistent with the neuronally-restricted expression pattern 
of the majority of olfactory receptors. 

Temporal expression clusters 

To functionally characterize the global gene expression 
patterns we clustered the gene expression profiles using 
k-means. We only considered genes with a sufficiently 
high expression level in at least one of the 8 samples (>2 
Reads per Kilobase per Million of Reads, RPKM), which 



change their expression by at least 2-fold, and have an 
mRNA transcript length of at least 100 bases. As a first 
approximation, for most genes with varying expression 
their expression either decreases or increases monotoni- 
cally with time (Figure 2). One group of genes (cluster 8) 
is mainly expressed only in the adult sample, indicating 
that these genes turn on late in spermatogenesis. The 
expression of the majority of the clustered genes de- 
creases with age (dpp). However, some genes for which 
expression peaks in the middle of the time course are 
also discernible. To have a finer resolution of gene pro- 
files, we split them into 8 clusters (Figure 2; see Materials 
and Methods and Additional file 2). 

Next, we compared our clustering with three previ- 
ously published papers - Chalmel et al. [6], Shima et al. 
[9] and Schultz et al. [8]. Overall we find a good agree- 
ment - 89% or more genes present in common are clas- 
sified similarly (Additional file 1: Figure S2). To further 
validate our clustering results we looked at several well 
known genes associated with meiosis and their corres- 
ponding clusters in the present study, as well as in the 
three microarray studies discussed above (Table 1; see 
also Additional file 1: Table S3 for an extended list of 
genes assigned with Gene Ontology terms meiosis and 
spermatogenesis). While there is an overall agreement 
for genes interrogated by both platforms, there are some 
potentially misclassified genes in the microarray studies. 
For example, Dmcl is in the post-meiotic (PM) cluster 
of Chalmel et al, while Mndl is in the early expression 
cluster A of Shima et al. In fact, Dmcl and Mndl genes 
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Figure 2 Temporal gene expression clustering. Genes with maximal expression above 2 RPKM, at least two-fold expression change and a 
mature transcript length >100b were clustered. Given these criteria, 12,895 genes were selected for clustering into 8 clusters. The cluster sizes are 
1313, 3227, 2826, 1302, 1087, 900, 895 and 1345 for clusters from 1 through 8, respectively. The heatmap (A) and normalized expression profiles 
of cluster centroids (B) are shown. As clusters 3, 4, 6 and 7 show a rising profile during meiosis, 8 to 20dpp, these clusters were designated as 
meiotic clusters. 
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Table 1 Expression profiles of selected genes associated with meiosis 
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While Dmcl and Mndl are found in our cluster 3 - see Figure 2 - tiiey are eitlier misclassified, or unclassified by the microarray-based study of Shima et al. [9] 
and Chalmel et al. [6]. Note that there were no Affimetrix probes in the regions of the Spatall gene and Prclm9 gene (in [8] and [9]). Genes marked with (^) have 
not been either interrogated or classified in the microarray studies [6,8,9]. 



play important roles during meiotic recombination and 
belong to our intermediate cluster 3 (one of our meiotic 
clusters, see below). In agreement with our clustering, 
immunohistochemical analysis of Dmcl protein found it 
in leptotene-to-zygotene spermatocytes [20]. Another 



example is Prdm9 gene, which has recently attracted 
much attention due to its role in determining meiotic 
recombination [21-23]. There are no probe sets for this 
gene in the Affymetrix microarrays used in [8] and [9], 
and it was not classified in [6], probably due to a lack of 
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a signal Similarly, the recently characterized gene Spata22 
[24], known to be essential for meiotic progression, is also 
absent from the microarrays. 

Henceforth, we identif)^ clusters 3, 4, 6 and 7 (Figure 2 
and Additional file 1: Figure S2) as meiotic clusters, 
given that the expression levels of genes belonging to 
these clusters rise coincident with the appearance of 
various meiotic cell types. We find that in these meiotic 
clusters (Figure 2) there are a total of 5,923 genes. Out 
of these 5,923 genes nearly a quarter (1,555 total/1,048 
protein coding genes) were either not present on mi- 
croarrays or were not differentially expressed in previ- 
ous microarray studies [6,8,9] (Additional file 3). We 
thus refer to these genes as novel meiotic genes. We 
must clarify, however, that together with more than 
200 uncharacterized or poorly characterized genes, 
some genes with solidly established meiotic function 
but that were not characterized by microarrays are on 
this list. These genes include Prdm9, Spata22, Morc4, 
Meil and Mlhl, To avoid any ambiguity we empha- 
size that our "novel meiotic genes" do not have to be 
expressed in testis exclusively and include some genes 
with previously assigned meiotic function that were 
not previously characterized as meiotic in other high 
throughput studies. 



In silico determination of cell type-specific gene 
expression 

Our gene expression data set is temporal - we have 
measurements of gene expression levels in whole mouse 
testis at different ages. Testes consist of somatic and 
pre-meiotic germ cells, meiotic spermatocytes and post- 
meiotic spermatids and each of these cell types contains 
numerous subtypes that have their own characteristic 
gene expression profiles [1,25]. Thus, the observed gene 
expression level in a sample prepared from a total testis 
is a sum of gene expression levels from individual cell 
types. Moreover, during the first wave of spermatogen- 
esis, the proportions of different cell types change dras- 
tically. To better understand functional processes during 
the course of spermatogenesis it would be desirable to 
obtain estimates of cell type-specific gene expression. 
Here we use a computational approach to deconvolve 
temporal gene expression profiles from a mixture of cell 
types into cell-type specific expression profiles (Figure 3). 
A similar approach has been proposed and tested in the 
literature [26-31], although typically with fewer cell types 
and for microarrays. 

We took advantage of the digital nature of RNA-Seq 
data, and developed a weighted least squares optimization 
algorithm that allowed us to estimate gene expression 
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Used 5 cell types: 

A <- primitive type A spermatogonia, Sertoli, Leydig, myoid 
B <- types A and B spermatogonia, pre-leptotene and 

leptotene spermatocytes 
C <- zygotene spermatocytes 
D <- pachytene spermatocytes 
E <- secondary spermatocytes, round and condensing 

spermatids 

Figure 3 Schematics of the deconvolution algorithm to estimate cell type-specific gene expression. We have measured gene expression 
by dpp (S), and have estimates of cell type fractions by dpp from the literature (F). Our goal is to estimate gene expression by cell type (C), as 
well as to re-estimate cell type fractions, or cell type contributions to gene expression (F). The iterative procedure is depicted on the figure, and 
the details are in Materials and Methods. Due to both biological and mathematical considerations, 5 combined cell types were considered in 
our analysis. 
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levels in individual cell types (Materials and Methods). 
Briefly, starting with initial estimates of cell type pro- 
portions, we estimate ceU type-specific gene expression, 
which in turn can be used to iteratively re-estimate cell 
type proportions. The initial estimate of cell type fractions 
is based on previously reported values [32] with some of 
the ceU types grouped together (Figure 3). Based on math- 
ematical, as weU as biological considerations, we chose to 
divide all cells into five cell types (or cell type groups) A 
through E (Materials and Methods). The fraction of non- 
meiotic cells (denoted A) drops significantly from 6 dpp 
to adult mice, while proportions of different germ cell 
populations rise and decay throughout the time course 
(Figure 4). Although there were no zygotene spermato- 
cytes at 10 dpp in our initial estimate, they appear after 10 
iterations, which is consistent with previously published 
experimental data [33]. Similarly, we also found that the 
contribution of spermatids (fraction E) to the expression 
in whole testis is negligible at and before 20 dpp. Similar 
to the clustering of temporal gene expression, we also 
clustered cell type-specific gene expression (Table 1). This 
clustering associates genes with certain cell types, similar 



to the classification of cell-sorted gene expression in [6] 
(Figure 4C and Additional file 4). 

The summary of ceU type-specific gene expression ob- 
tained after 10 iterations can also be represented as a 
heatmap (Figure 4) that shows that most genes are ex- 
pressed in one or two cell types. There are many genes 
that are expressed both in somatic. A, and in one of the 
germ cell types. Also, there is a significant overlap in 
gene expression between pachytene and secondary sper- 
matocytes and spermatids, D and E. This observation, 
which is based on a purely computational analysis of 
temporal expression data, is in agreement with previous 
experimental observations [7,34,35]. 

Furthermore, we have compared our cell type-specific 
gene expression predictions to a recently published stu- 
dy [13] that measured gene expression of sorted Sertoli 
and germ cells using RNA-Seq. There is a good corre- 
lation between the experimental measurements of gene 
expression reported in [13] and our deconvolution esti- 
mates (Additional file 1: Figure S3). We have clustered the 
reported experimental gene expression into 5 clusters, each 
corresponding to one of the cell types considered in [13]. 
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Figure 4 Results of the deconvolution algorithm to estimate cell type-specific gene expression. (Left) Cell type fractions used in the first 
iteration (A; based on Bellve et al. [25]) and obtained after 10 iterations (B). (C) Cell type-expression heatmap of genes selected for deconvolution, 
obtained after 10 iterations after genes were clustered in 5 clusters, corresponding to the cell types considered. Somatic expression (cell type A) 
is more ubiquitous, and cell types D and E share many expressed genes. 
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This produced an overall consistent gene classifi- 
cation between our clusters A through E and the five 
clusters we derived from [13]. Drawn as a heatmap, 
gene expression from [13] resembles the pattern of our 
cell-type expression heatmap (compare Figure 4 and 
Additional file 1: Figure S3). 

Soumillon et al. [13] also define four expression clus- 
ters - cluster 1 has high gene expression in sperma- 
tocytes and spermatids, relative to spermatogonia and 
spermatozoa, and cluster 2 is the opposite; cluster 3 is 
high in spermatogonia and spermatids versus spermato- 
cytes and, to a lesser extent, spermatozoa; cluster 4 is 
low in spermatozoa relative to the three other types. Ser- 
toli cells are not shown in this clustering. A comparison 
of the genes shared in our deconvolution clusters A-E 
and those in these four clusters shows that cluster 1 
mostly corresponds to our clusters D and E, cluster 2 to 
our clusters A and B, and clusters 1 and 2 mostly cor- 
respond to our cluster C. Clusters 3 and 4 share very 
few genes with any of our clusters, because many genes 
in these clusters from [13] have low levels of expres- 
sion. This can also be seen in the comparisons of our 
temporal clusters with those of [13] (Additional file 1: 
Figure S3, bottom). 

Looking at cell type-specific gene expression (Figure 4, 
Additional file 1: Figure S4), we found that more genes 
are expressed in somatic A, while fewer genes are ex- 
pressed in germ cell types (between 4,465 in C and 
7,609 in E as compared to 12,300 in A, out of 14,259 
considered). The median expression is largest in C - 76 
RPKM vs. 15-28 RPKM in other cell types. In addition, 
there are 1294, 133, 107, 163 and 528 genes in cell types 
A, B, C, D and E, respectively, that are classified as ex- 
pressed exclusively in those cell types. 

While our statement that fewer genes are expressed in 
individual germ cell types than in somatic type A seems 
to disagree with the statements in [13] that more genes 
are expressed in germ cell types than in Sertoli cells, 
there are several factors to consider. First, our type A in- 
cludes not only Sertoli cells (Figure 3); second, we only 
considered a subset of genes for the deconvolution pro- 
cedure; third, the deconvolution procedure attempts to 
minimize the number of expressed genes needed to ex- 
plain the observed temporal expression data, and is con- 
servative in that sense; fourth, the experimental cell 
sorting purity is on average around 90% and so gene 
expression from other cell types might be observed - 
hence, cell sorting is a permissive approach. Our finding 
that the median expression is largest in C (zygotene 
spermatocytes) resembles the high per cell RNA count 
in spermatocytes reported in [13]. 

Using the deconvolution results, we can associate 
temporal clusters with specific cell types. While this 
could in principle be done based on the timeline of 



spermatogenesis, here we have a quantification of this 
correspondence. For example, temporal cluster 3 has the 
highest contribution from B (spermatogonia A and B, and 
pre-leptotene and leptotene spermatocytes) while tem- 
poral cluster 6 is dominated by D (pachytene spermato- 
cytes) (see Additional file 1: Figure S5). 

One way to validate our predictions is to look at some 
well-known genes. We found that, overall, these genes 
have been properly classified (Table 1, Additional file 1: 
Table S4). For example, the synaptonemal complex genes 
Sycpl-3 were mostly found expressed in zygotene (C), 
protamines are expressed in spermatids (E), Dmcl was 
found in pre-leptotene and leptotene primary spermato- 
cytes, and many Sertoli cell markers were correctly classi- 
fied in cluster A (e.g., Gdnfy EtvS [36]). Spermatogonia 
genes (e.g., Dazl [36]) were classified correctly as well. 
Also, as cell type A includes primitive type A sperma- 
togonia, the assignment of Cdhl to cluster A is appro- 
priate [36]. Further details are discussed in Materials and 
Methods. 

Alternative splicing 

One of the advantages of mRNA-seq over microarrays is 
that all expressed targets can be assessed. We were 
interested in finding novel exon-exon splice junctions, 
which represent new alternative isoforms of known 
genes. First, we checked our ability to detect known 
splices in our data (Additional file 1: Table S5). We 
found that the majority of known splices can be de- 
tected. Pooling all the experiments together, we de- 
tected -75% of about 2x10^ known splices. The sensitivity 
of splicing detection decreases for individual samples, but 
we still observed at least 50% of known splices in each of 
our individual samples. 

We then considered all annotated exons within each 
gene and constructed a list of all possible splice junc- 
tions (~2 X 10^ junctions; see Materials and Methods). 
Overall, we found support for about 13,000 new junc- 
tions (Additional file 1: Table S6 and Additional file 5). 
Although many of these novel splicing variants are ex- 
pressed at a low level, some were rather abundant. 

We found that 38% (4,619 out of 12,272) of the novel 
intra-isoform exon skips (splice junctions between non- 
neighbor exons, per known isoform annotation) preserve 
an open reading frame (ORE). The extent of ORE preser- 
vation is higher for highly-expressed novel splices, with 
five out of the top six splices - in genes RShdml (single- 
stranded nucleic acid-binding), Atp8b3 (ATPase), Usp34 
(ubiquitin hydrolase), Wtl (transcription factor essential 
in development of urogenital system) and SuptSh (see 
below) - being ORF-preserving. Importantly, these abun- 
dant novel ORF-preserving splices constitute a significant 
fraction of the total gene expression. On average, the max- 
imal proportion of novel isoform expression is above 43%, 
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for splice junctions with at least 10 independent reads. 
One of the particularly abundant novel variants is the 
skipping of a coding exon (chr7:29,101,301-29,101,460) 
in the SuptSh gene with the preservation of the ORF 
(Figure 5). This gene participates in the regulation of 
mRNA processing and transcription elongation by RNA 
polymerase IL Expression of this novel splice junction at 
14 dpp to 38 dpp is on the order of 20 to 60 RPKM, rising 
at later time points, while SuptSh expression as a whole is 
between 60 to 85 RPKM for 6-38 dpp. Hence, this novel 
splice isoform can contribute a significant fraction to the 
total genes expression, especially later in spermatogenesis. 
This suggests that this SuptSh isoform might be important 
for sperm maturation. 

In the novel splices, we asked how many exons are 
skipped, when compared to the known isoforms. Ty- 
pically 85 to 90% of our novel splices involve skipping 
one exon (as reported for other tissues), and about 6-8% 
skip two exons (Additional file 1: Figure S6). While most 
novel splices skip one exon, there are exceptions like the 
skipping of exons 5 through 15 in the spermatogenesis- 
associated gene Spata5. However, the ORF is not pre- 
served. Finally, we detected 740 novel exon-exon junctions 
that could only be formed by splicing exons present 
exclusively in alternative known isoforms of a given 
gene. Out of these 740 junctions, 79 (11%) preserve 



the ORF, suggesting that they represent functional trans- 
cript variants. 

Gene predictions 

Since RNA-seq data are not restricted to annotated genes, 
we looked at regions of high expression outside of any of 
the UCSC knownGene, refGene and xenoRefGene lists. 
Interestingly, for many such regions with a high level of 
expression there was an associated (overlapping) Genscan 
gene prediction. Hence we asked whether we can detect 
meiotic expression of computationally predicted putative 
genes. We compared RNA-Seq reads to those expec- 
ted from gene models of the Ensembl and Genscan 
databases. Our data yielded support for 70 gene mo- 
dels from Ensembl, that don't overlap with annotated 
genes found in the UCSC knownGene, refGene and 
xenoRefGene (non-mouse genes) tables. To ensure a 
high level of specificity, we demanded that we ob- 
serve at least one predicted splice site for the models 
considered which would constitute an additional indica- 
tion of transcription and transcript processing (Additional 
file 6). Similarly, we found support for 97 Genscan 
gene models that do not overlap with any known genes 
(Additional file 7). Most of these models are expressed at 
low levels, with only 41 showing maximal expression 
above 2 RPKM (in comparison, the average expression 
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Figure 5 Gene and isoform predictions. (Top) Novel alternative splicing of transcription elongation factor SuptSh differentially expressed in 
spermatogenesis. We see an ORF-preserving exon skipping event (A) that is frequently detected in samples after 12 dpp (B). (C) Predicted 
transcripts (variant 1 and variant 2 differ only by one retained intron). This prediction is based on our RNA-Seq data in combination with UCSC 
and Ensembl annotations and Genscan computational gene predictions. Only the protein coding part of the predicted transcripts is shown. The 
respective lengths of the predicted proteins are 566aa and 589aa. Both of these variants have a HMG box, which is a DNA-binding domain, 
towards their C-terminus, which is completely missed by the UCSC and Ensembl gene models. Expression at this locus peaks sharply at 12dpp. 
Note that there are a few differences in the boundaries of some exons between the predicted transcripts and the known annotation, which are 
invisible here. 
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level of known genes is around 10 RPKM, and the uni- 
form genome coverage would result in -0.3 RPKM). One 
interesting example is the Genscan model chrX.S9S dem- 
onstrating a meiotic upregulation, peaking at 12 dpp with 
120 RPKM. This model has 16 exons; we see 11 splice 
junctions, 5 of which are splices between neighboring 
exons. Therefore, there is evidence for alternative splicing 
as well. We note that out of the 97 Genscan gene models, 
only 8 overlap the 70 Ensembl models described above. 

Recently, a class of large intervening non-coding RNAs 
(lincRNAs) has been described and studied. They are 
proposed to form ribonucleic-protein complexes acting 
in numerous cellular pathways. In mouse, these lincRNAs 
have been obtained from four cell types not involved in 
spermatogenesis [37-39]. We asked whether we had evi- 
dence for expression of any of these lincRNAs in our data, 
and how it changed over time. Combining lincRNA ge- 
nomic regions and exonic structures reported in [37,38] 
yielded 6,129 gene structures that don't overlap known 
UCSC genes. Given that these gene structures were not 
manually curated, we conservatively restricted our analysis 
to long structures with high and variable expression (total 
exon length > 1000 bases, RPKM > 5, and at least a 5-fold 
change in expression), in order to focus on potentially 
more reliable variants. This filtering yielded 59 structures 
(many partially overlapping), which can be interesting 
candidates for future studies (Additional file 8). 

Since lincRNAs are defined as non-coding transcripts, 
we also analyzed known UCSC genes that are annotated 
as non-coding. There are 6,544 such genes. 1,231 of 
them have maximal RPKM > 2, at least a 2-fold change 
in expression and an mRNA length at least 100 bases; 
254 of these transcripts overlap lincRNAs. With stricter 
thresholds, 228 of the 6,544 non-coding genes have ma- 
ximal RPKM > 5, fold change > 5 and an mRNA length 
at least 1 kb; 50 of them overlap lincRNAs. As we are 
mostly interested in meiosis, gene AK034341 caught our 
attention, as its expression peaks quite sharply at 12 dpp 
with around 30 RPKM. We noticed that this gene has 
two isoforms in the Ensembl and Vega annotations, both 
with coding potential. Moreover, our RNA-Seq data sug- 
gested that none of these annotations accounted for the 
observed coverage toward the 3' end of the transcript. 
Computationally predicted Genscan transcript chr7.07277 
was closer to the observed RNA-Seq read coverage. We 
therefore performed a detailed reconstruction of this 
transcript. The result is mostly the merger of the 5 ' part 
from the Ensembl/ Vega annotations with the 3 ' part from 
the Genscan predictions, with additional corrections to 
exon boundaries in some exons. The resulting two pos- 
sible very similar transcripts are protein coding, and pos- 
sess an HMG-box, a DNA-binding domain, towards their 
3 ' end (Figure 5). Their predicted protein lengths are 566 
aa and 589 aa (Additional file 9). Given that its expression 



profile is highly similar to that of the Prdm9 gene, this 
novel gene is an interesting candidate for further study. 

Polyadenylation 

One of the mechanisms regulating gene expression is al- 
ternative polyadenylation [40]. While our experiments 
were not specifically designed to address this question, 
our data contain multiple reads covering polyadenylation 
sites. Hence, we asked whether we could detect alterna- 
tive polyadenylation during spermatogenesis and to what 
extent. We implemented a mapping strategy based on 
partial mapping of previously unmapped reads, together 
with observation of a polyadenylation tail and a polyade- 
nylation signal (see Materials and Methods). Mapping 
reads to the whole genome, we identify 5,229 candidate 
polyadenylation sites (if we demand zero mismatches; 
if we allow for <1 mismatch, the corresponding num- 
ber is 6,801 - see Figure 6 and Materials and Methods). 
3,623 (4,269) candidate sites lie within known gene tran- 
scripts ±10 bases, and of those 2,222 (2,365) are at 10 
bases or less from the closest known 3 ' transcription end. 
638 (676) candidate sites are right at the known 3' tran- 
script end (Figure 6). There are over 20,000 genes, many 
of which have multiple isoforms/transcripts, some of 
which have distinct 3' ends. As there are 49,409 known 
transcripts we detect only -10% or less of known 3' ends. 




Figure 6 Detection of polyadenylation. The proportion of the 
polyadenylation sites perfectly matching the annotated 3' ends of 
known genes is shown in the center of the diagram. Subsequent 
rings going outwards show the percentage of tentative 
polyadenylation sites successively further away from the known 
3' ends. The percentages without brackets are based on the 
total of 5,229 sites found with zero mismatches, while bracketed 
percentage values correspond to the total of 6,801 sites allowing 
up to one mismatch - see text for details. 

V ) 
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In summary, out of 5,229 candidate sites, 2,222 are 
within 10 bases from known sites, while 3,007 sites are 
farther away. About half of the latter are within known 
gene bodies. Such sites are potentially novel sites of al- 
ternative polyadenylation. Extrapolating these numbers 
to all genes, we estimate that there can be over 30,000 
alternative polyadenylation sites active through sperm- 
atogenesis. This finding is consistent with recent work 
highlighting the abundance of alternative polyadenyla- 
tion [40-42]. 

Meiotic sex chromosome inactivation 

Gene expression on the X chromosome is strongly re- 
pressed in pachytene via a process known as meiotic 
sex chromosome inactivation (MSCI) [43-45]. Our data 
show clear evidence of MSCI - gene expression of X- 
linked genes drops dramatically beginning around 16 dpp 
(Figure 7) and there are no X-linked genes in the late 
meiotic clusters 6 and 7 (Additional file 1: Figure S7). 
Similarly, our deconvolution results show only 15 X- 
linked genes with pachytene (D) expression, and in all 
these cases, except one {GmlSOVO, which is a hypothetical 
protein), the estimated expression in D is small compared 
to the expression in some of the other cell types. Post- 
meiotically, however, the X chromosome is no longer 
strongly repressed and many X-linked genes belong to 
temporal cluster 8 and are expressed in cell type E. 

RNA Polymerase II ChlP-Seq 

RNA polymerase II is responsible for transcription of 
pre-mRNAs as well as various non-coding RNAs. To stu- 
dy the correlation of Pol II binding with gene expression 



in spermatogenesis, we performed two Pol II ChlP-Seq ex- 
periments, on testes of 10 and 16dpp mice, and used a 
third, published dataset for adult testis [46]. Typically, we 
see a clear Pol II binding signal at -36% of annotated tran- 
scription start sites (TSSs), and sometimes an increased 
signal along genes (Figure 8). There is a noticeable cor- 
relation between the level of gene expression and the 
strength of the Pol II signals measured at 10 and 16 dpp - 
Spearman correlation is in the range 0.45 - 0.8, and is 
highest at or around the matching time point. Sig- 
nificantly, the correlation is high not only for a Pol II 
signal around gene TSSs, but also in the gene bodies, 
1 kb and farther downstream from TSSs (Additional 
file 1: Figure S8). 

We detected accumulation of Pol II at -10,000 cano- 
nical transcription start sites (TSSs) and -600 genes use 
known alternative TSS s instead. Of the remaining genes, 
100-200 have strong Pol II peaks within the gene body. 
Overall, -1,000 strong Pol II peaks lie within gene 
bodies (more than 1 kb away from TSS), and there is 
a similar number of strong peaks farther than 1 kb 
from annotated gene bodies. Using more comprehen- 
sive sets of predicted Pol II promoters and TSS loca- 
tions ([47,48] and a table of mRNAs from the UCSC 
browser (all_mrna)), we observed an increased overlap 
with our Pol II signals (Additional file 1: Table S7). 
We also observed Pol II in CpG islands and overlap- 
ping histone 3 lysine 4 trimethylation (H3K4me3) marks 
[49], associated with active promoters (Additional file 1: 
Table S7). 

We found that the majority of active gene promoters 
(TSS s) have an accumulation of Pol II at all three time 



km; 


14 -| 


□_ 




DC 


12 - 


c 




o 


10- 


"w 




w 




o 


8- 


Q. 




X 




CD 


6- 


CD 




C 




CD 


4- 


O) 




CD 




O) 


2 - 


CC 




vet 


0- 


< 





-O- Autosomes -•-chrX 




10 12 14 16 18 20 38 
Age (dpp) 





Expression 
High 



Low 



A B C D E 
Deconvolution cell type 



6 10 12 14 16 18 20 38 
Age (dpp) 

Figure 7 The X chromosome is transcriptionally silenced in spermatogenesis as a result of Meiotic Sex Chromosome Inactivation. 

(A) The average expression of X-linked genes decreases after 12dpp, and eventually drops by a factor of ~3 in adult testes. (B), (C) Heatmaps of 
X-linked genes. Only genes selected for temporal clustering are shown (B). The drop in gene expression, relative to genome-wide expression, is 
evident starting at around 16dpp, the pachytene stage. In the adult testes, at 38dpp, many genes that have not been active during the earlier 
stages of the first wave of spermatogenesis are turned on. This is also evident from the deconvolution calculations (C). 
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Figure 8 Example of mRNA-Seq and Pol II ChlP-Seq coverage In a 164 kb genomic window on chromosome 8 (chr8:97,939, 
000-98,103,500). mRNA expression is liigli in blue and absent in grey. Tliere is Pol II accumulation at promoters of genes, and an elevated Pol I 
signal can be seen along gene bodies as well. Here, some genes are expressed throughout the whole time course, while others are initially 
virtually silent. Consistent with these expression profiles, the presence of Pol II at corresponding promoters occurs at different time points. 



points (10 and 16 dpp, and adult) (Additional file 1: 
Figure S9). About half of the 19,858 genes with unam- 
biguously assigned TSSs have at least one of their pos- 
sibly multiple TSSs occupied by Pol II at at least one 
time point. Dissecting the genes by their expression clus- 
ter, we saw the expected picture of a higher lOdpp Pol II 
promoter occupancy in early expression clusters and a 
higher adult Pol II promoter occupancy in late expression 
clusters (Additional file 1: Figure SIO). We also considered 
genes with roughly constant expression (maximal RPKM 
above 2 and fold change less than 2, denoted as const' 
cluster), and genes with low expression (maximal RPKM 
below 2 in all the samples, denoted as low' cluster). 88% 
of genes with roughly constant expression (const' cluster) 
have their promoters occupied in at least one of the three 
time points considered, and 82% in all three time points. 
In contrast, only 7% of weakly or non-expressed genes 
(low' cluster) have Pol II signals at their promoters at 
some point, and these signals are weak. 

To address whether there is promoter occupancy by 
Pol II but no or low expression, we focused on late ex- 
pression clusters 7 and 8, and looked for genes with a 
Pol II signal at lOdpp. This yielded 106 out of 1179 genes 
for cluster 8 and 230 out of 711 genes for cluster 7. We 
also considered genes from cluster 8 with a Pol II signal at 
16dpp - 154 out of 1179. Using DAVID [50,51], we found 
that these lists of genes are enriched in annotations like 
cytoskeleton, intracellular non-membrane-bounded or- 
ganelle, microtubule, cell projection, cilium and related 
terms, with annotation clusters enrichment scores bet- 
ween 1.53 and 4.39. We note that these annotation terms 
are enriched even when the whole corresponding expres- 
sion cluster is used as a background (instead of the whole 
set of mouse genes). Thus it seems lil<ely that genes re- 
sponsible for sperm motility have significant accumulation 
of Pol II around their TSS's early on, but are expressed at 
relatively low levels until later. This observation is similar 



to the situation of a poised Pol II, which is quite ubiqui- 
tous during development [52,53] . 

We also analyzed cases of gene expression without 
clear Pol II at annotated TSS's. We considered expres- 
sion clusters const' and 8 (Additional file 1: Figure SIO) 
as one would expect to see Pol II at TSS's of all const' 
genes, and cluster 8 has the least percentage of genes 
with Pol II at TSS's. For the 'const' expression cluster 
most of the genes in question have in fact nearby Pol II 
at one of their annotated TSS's, which have been omit- 
ted due to either the ambiguity of assigning a TSS inter- 
val to a gene (e.g., in cases of bidirectional promoters), 
or because short gene transcripts were excluded; in 
some cases, the gene expression, or the mappability of 
the TSS region is low; in yet other cases there seems to 
be an incorrect annotation, for example defining part of 
one transcript as a separate transcript with its own TSS, 
leading to disagreements between the RefSeq and UCSC 
annotations. For expression cluster 8, in addition to the 
above, there are cases of overlapping transcripts on op- 
posite strands; sometimes Pol II signals are too weak to 
be called (genes with statistically significant Pol II signal 
near their TSS's have a much higher maximal RPKM: 
median 51 vs. 10 for those without such signal). Another 
factor is that the adult Pol II ChIP data [46] used by us 
seems to correlate better with gene expression at 18 and 
20dpp than with 38dpp. Certainly, some genes use un- 
annotated alternative TSS's, either within or outside of 
the annotated gene transcripts. One possible example 
is the Texl6 gene whose putative un-annotated TSS 
lies -200 kb upstream of its annotated 5 ' end (Additional 
file 1: Figure Sll). 

Discussion 

Identification of meiotically expressed genes 

To define genes expressed in meiosis we sequenced the 
transcriptome of murine testes at eight time points during 
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the first wave of spermatogenesis. We then applied two 
approaches for the identification of genes expressed 
during meiosis: clustering of temporal gene-expression 
profiles and computational deconvolution of temporal 
expression patterns into cell-type specific expression 
profiles. In four out of eight temporal expression clusters 
median expression rose between 10 and 20 dpp - a time 
period that corresponds to the meiotic stages of spermato- 
genesis. We thus designated genes from these clusters as 
meiotically expressed. A comparison of this gene list with 
previous publications allowed us to define 1,048 novel 
meiotic genes which were not previously analyzed by 
microarray studies (Additional file 3). As expected, our 
novel meiotic genes are significantly enriched in such 
meiosis-related Gene Ontology categories as microtubule- 
based movement and response to DNA damage stimulus, 
among others (Additional file 1: Figure SI 2). 

Our experimental measurements of gene expression 
levels constitute bulk expression estimates from hetero- 
geneous samples containing various proportions of germ 
and somatic cells. To estimate gene expression levels in 
individual cell types we developed a computational decon- 
volution approach (Figure 3). Using this deconvolution 
procedure, we found 375 genes (262 protein-coding) that 
are exclusively expressed in meiotic cell fractions C and D. 
Out of them 205 (111 protein-coding) are novel. When 
we compare our temporal clustering classification with 
the deconvolution predictions, we find that 920 out of 
1,048 novel meiotic genes identified via clustering of the 
temporal gene expression profiles are highly expressed in 
the cell types that contain meiotic cells, B, C and D (with 
567 in C and D). Thus, both of our analytic approaches 
define a consistent set of novel meiotic genes. 

Although the vast majority of annotated gene models 
in mouse have experimental support, there are numer- 
ous gene models predicted only using computer tools. 
Here we find direct experimental support for 159 compu- 
tationally predicted transcripts that don't overlap known 
genes. Also, based on partially correct computational mo- 
dels in the locus corresponding to the known non-coding 
gene AK034341, we predict a protein-coding putative gene 
of 566/589 aa with a DNA binding HMG box domain 
(Figure 5). The expression profile of this gene shows 
a sharp peak of transcription in meiosis, between 10 
and 14 dpp. 

Meiotic sex chromosome inactivation 

It is clear that MSCI suppresses X-linked gene expres- 
sion in male meiosis (see above and [34,45,54]). Previ- 
ously it has been suggested that MSCI persists beyond 
meiosis into spermiogenesis [34]. Our temporal profiling 
results and in silico deconvolution allows us to estimate 
post-meiotic gene expression of X-linked genes and test 
the hypothesis that the X chromosome is still inactivated 



past meiosis. It has been reported [34,54] that 13% to 
18% of X-linked genes are expressed after the comple- 
tion of meiosis in mouse. In our analysis, only 27% of 
X-linked genes are expressed in the post-meiotic decon- 
volution cell type E, compared to 46-59% of autosomal 
genes. In contrast, 85% of X-linked and 84-88% auto- 
somal genes are expressed in the mostly somatic cell 
type A. Moreover, 12.4% of X-linked genes are expressed 
mostly in the post-meiotic deconvolution cell type E 
(Figure 4), a proportion that is lower than that for auto- 
somal genes (13.3-17.5%). Thus, even though MSCI is 
relaxed after the completion of meiosis as suggested be- 
fore [34], the X chromosome is transcriptionally sup- 
pressed in post-meiotic cells. This is consistent with 
recent work [13]. 

Dynamics of RNA polymerase II binding patterns and 
their relation to gene expression 

We supplemented our RNA-Seq data with RNA poly- 
merase II ChlP-Seq. While most promoters of expressed 
genes have a discernible Pol II ChlP-Seq signal, thou- 
sands of locations far from annotated gene promoters 
display accumulation of Pol II. These locations can mark 
novel promoters or enhancers, or clusters of small RNAs. 
Indeed, comparison of the Pol II signals at lOdpp and 
16dpp clearly identified activation of pachytene piRNAs 
after lOdpp (Additional file 1: Figure S13). Consistent with 
previous findings [55-57], we found an enrichment of Pol 
II in our 16dpp (pachytene) sample in 80 out of 94 piRNA 
clusters. 

Previous studies have shown that in development, gene 
promoters often harbor a poised Pol II at genes that 
remain suppressed until a certain developmental stage 
[52,53]. Because spermatogenesis is also a developmental 
process, we hypothesized that a similar phenomenon 
might be happening here as well. While our experi- 
mental procedures (ChlP-Seq) cannot determine if the 
observed Pol II is transcriptionally engaged [53], we 
find that genes likely to be associated with sperm mo- 
tility have a Pol II signal around their TSSs as early 
as at lOdpp. This suggests that promoter poising is com- 
mon in spermatogenesis. 

Conclusions 

Here we created a comprehensive reference dataset of 
gene expression in mouse spermatogenesis. We analyzed 
expression in males aged from 8 dpp to 38 dpp effect- 
ively covering all stages of spermatogenesis from pre- 
meiotic cells to spermatozoa. We then computationally 
detected genes expressed pre-meiotically, early and late 
in meiosis and in sperm. Since previous studies of gene 
expression during spermatogenesis in mouse were pri- 
marily performed using microarrays, only a subset of the 
genes was analyzed. Unlike microarrays, our data cover 
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essentially all mouse genes. An additional advantage of 
RNA-Seq compared to microarrays is superior sensitiv- 
ity. These two factors allowed us to define over a 1,000 
novel meiotically-expressed genes. Furthermore, our 
data show that alternatively spliced isoforms are abun- 
dantly represented through spermatogenesis. In total, we 
found more than 13,000 of novel alternative splices some 
of which are highly expressed in meiosis. To further en- 
hance our dataset and better define promoters we sup- 
plemented transcription profiling with RNA polymerase 
II ChlP-Seq at three time points. This allowed us to de- 
scribe temporal dynamics of Pol II behavior in sperm- 
atogenesis and detect numerous regions enriched for Pol 
II away from annotated genes. Taken together, we be- 
lieve that our integrated dataset is a valuable resource 
for further studies of gametogenesis and meiosis in 
vertebrates. 

Methods 

Sample preparation 

For mRNA-Seq, whole testes mRNA extracts from mur- 
ine pups [58] aged 6, 10, 12, 14, 16, 18, 20 and 38 days 
post partum (dpp) were obtained. Total RNA was puri- 
fied using a Total RNA Isolation Mini Kit (Agilent) ac- 
cording to the manufacturer s instructions. 20 [ig of total 
RNA was loaded on 100 [A oligo(dT) Dynabeads (Dynal) 
and polyA + fraction was eluted in 15 [A of DEPC- 
treated water. 0.5 [ig aliquots of polyA + RNA were frag- 
mented in total volume of 10 [A of RNA fragmentation 
buffer (40 mM Tris-Acetate, pH 8.1, 100 mM KOAc, 
30 mM MgOA) for 2 min at 95°C. Samples were diluted 
to 100 [A using DEPC treated H2O and immediately sub- 
jected to ultrafiltration on Amicon YM-30 filters. First 
and second strand cDNA synthesis was performed using 
a Superscript double-stranded cDNA synthesis kit (Invi- 
trogen) using 150 ng of fragmented RNA and 500 ng of 
random hexamers (Promega) as recommended by the 
manufacturer. The standard Illumina protocol was used 
to construct the sequencing library. 

RNA polymerase II (Pol II) ChIP procedure follows 
the standard protocol from manufacturer. Whole testes 
extracts were collected from 10 and 16 dpp pups, and 
immediately cross-linked with 1% paraformaldehyde in 
1 X PBS for 10 minutes at room temperature. Cross- 
linked cells were washed with pre-lysis buffer A (0.25% 
Triton X-100, 10 mM EDTA, 0.5 mM EGTA, 10 mM 
Tris-HCl pH 8) and pre-lysis buffer B (0.2 M NaCl, 
1 mM EDTA, 0.5 mM EGTA, 10 mM Tris-HCl pH 8). 
Cells were pelleted and sonicated in SDS lysis buffer (1% 
SDS, 10 mM EDTA, 50 mM Tris-HCl pH 8 and Ix Pro- 
tease Inhibitor Cocktail (Roche) for 16 cycles at 15 sec 
on and 15 sec off in a sonicator water bath (Bioruptor, 
Diagenode) at 4°C. After that fragmented chromatin sam- 
ples were dialyzed with ChIP dilution buffer (0.01% SDS, 



1.1% Triton X-100, 1.2 mM EDTA, 16.7 mM Tris-HCl, 
167 mM NaCl), samples were incubated with antibody- 
coated magnetic beads (Dynabeads M-280, Invitrogen) 
overnight at 4°C. The antibodies used were rabbit po- 
lyclonal anti-RNA polymerase II CTD repeat YSPTSPS 
(phospho S5) (Abeam ab5131) for the 16 dpp sample, and 
rabbit polyclonal anti-RNA polymerase II (N-20) (Santa 
Cruz sc-899) for the 10 dpp sample. Magnetic beads were 
washed for 5 min once each in low-salt immune complex 
wash buffer (0.1% SDS, Triton X-100, 2 mM EDTA, 
20 mM Tris-HCl pH 8, 150 mM NaCl), high-salt immune 
complex wash buffer (0.1% SDS, Triton X-100, 2 mM 
EDTA, 20 mM Tris-HCl pH 8, 500 mM NaCl), LiCl im- 
mune complex wash buffer (250 mM LiCl, 1% NP-40, 1% 
deoxycholic acid (sodium salt), 1 mM EDTA, 10 mM Tris 
pH 8.0) and then twice in Ix TE for 5 min at 4°C. Immu- 
noprecipitated chromatin samples were eluted in elution 
buffer (1% SDS, 100 mM NaHCOs) for 15 min at 65°C 
and reverse cross-linked overnight at 65°C with 200 mM 
NaCl. Proteinase K treated reverse cross-linked DNA was 
purified with a MinElute PCR purification kit (Qiagen) 
and then the DNA concentration was measured using a 
Qubit dsDNA HS assay kit (Invitrogen). 

Data acquisition 

The sequencing was done on an Illumina Genome 
Analyzer. The summary of reads for mRNA-Seq is given 
in Additional file 1: Table SI. We acquired 36 bases long, 
single end reads. The first and the last two bases in each 
mRNA-Seq read were discarded, as having a high error 
rate relative to the rest. The first two bases are imprecise 
due to random primer mismatches, while the last two are 
due to the limitations in the instrument and reagents 
used, leading to higher error rates. Similar procedures 
were used to align Pol II ChlP-seq data, although without 
discarding the first and last bases. The data is available 
through the NCBIs Gene Expression Omnibus using the 
GEO Series accession number GSE44346. 

Data analysis 

All annotations were downloaded from the UCSC 
genome browser database. Specifically, we used the Mus 
musculus mm9 genome version, with gene annotation 
version 4 tables for the gene model definitions and for 
the selection of one isoform per gene, respectively. Other 
tables were used as well, as described in the text. 

Reads were aligned using ELAND short read alignment 
software for the Illumina Genome Analyzer (version 1 was 
used for mRNA-Seq reads and version 2 for Pol II ChlP- 
Seq). Only the uniquely aligned reads passing the quality 
filtering were selected for calculations reported here. Post- 
alignment analysis was done using custom scripts written 
in Perl and R languages, in conjunction with local MySQL 
database server. We calculated expression levels of all 
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known genes for all 8 samples using the RPKM (Reads 
Per Kilobase per Million of reads) measure of gene ex- 
pression [59]. 

For Pol II ChlP-Seq, we used three data sets, at 10 dpp, 
at 16dpp and at 8 weeks (adult mice). The adult set was 
downloaded from GEO, accession numbers GSE36027 
and GSE29184 [46]. Pol II ChlP-Seq data was analyzed 
using MACS (version 2.0.9) [60], bedtools (version 2.15.0) 
[61], CCAT (version 3.0) [62], and various custom scripts 
in Perl and R. To calculate the correlation of gene expres- 
sion with Pol II signals around gene TSS s and along gene 
bodies, background-corrected read counts within 250 bp 
of annotated TSSs, and along gene transcript excluding 
5 '-most 1 kb were considered, respectively. Short (<1 kb) 
gene transcripts were discarded. The correction for a 
non-specific signal (background or noise) was done by es- 
timation of the noise fraction with CCAT, as well as by 
comparison of ChIP and input signals in regions distant 
from annotated genes, and also by maximization of correl- 
ation by varying the noise fraction. All approaches pro- 
duced similar estimates. Given the noise fraction, the 
specific signal at a locus was estimated as the raw ChIP 
signal from which the estimated non-specific part is sub- 
tracted, given that there is a statistically significant differ- 
ence between raw signal and input (binomial test, false 
discovery rate 0.05). If the significance threshold was not 
reached the specific signal was set to zero. 

For the presence of Pol II peaks near TSSs, we 
checked for overlap of MACS -derived peaks with a TSS 
interval file containing all annotated TSS -centered 
500 bp intervals with overlapping intervals merged, hav- 
ing excluded TSSs derived from short gene transcripts 
(unprocessed transcript length <1 kb). This yields 27,775 
TSS-proximal intervals. 19,858 genes could be unam- 
biguously assigned to these intervals; the ambiguous 
cases with divergent transcription were excluded. Al- 
though the total number of Pol II peaks found in the 
three samples (10 dpp - 20,882, 16 dpp - 68,347 and 
adult - 15,790), varies significantly, and the antibodies 
used are different, the fact that there are between 9,000 
to 11,000 TSS-proximal peaks in all these samples serves 
as the basis for their use and comparison. 

Clustering 

To cluster the temporal gene expression profiles we used 
the k-means algorithm as well as the HOPACH algo- 
rithm [63], as they are implemented in the R statistical 
software (www.R-project.org). There are at least a few is- 
sues to consider in clustering. One is how to normalize 
gene expression. In this paper, each gene expression is 
normalized so that the sum of squares for all points 
equals 1. The advantage is that zero expression remains 
zero. Together with the Euclidean distance, this approach 
is basically equivalent to using the "cosangle" distance 



measure between the original gene expression profiles of 
two genes [63,64]. 

Another issue to consider is how many clusters to 
choose. This of course depends on how detailed we want 
the clustering to be. Application of HOPACH algorithm 
[63] results in 8 clusters for the first level of this hier- 
archical clustering approach. Going further down identi- 
fies more than 400 "main clusters". Our purpose here is 
to get a genome-wide view of the gene expression, and 
therefore we limited the number of clusters to single 
digits. Using k-means with 8 clusters and the same dis- 
tance measure yields very similar results and we use this 
clustering in the paper. 

We also used k-means with a cosangle distance mea- 
sure to cluster gene expression reported in [8] and [13], 
as discussed in the text. 

Deconvolution 

In the deconvolution procedure we assume that gene ex- 
pression of most of the genes in a given cell type is cell- 
autonomous, and does not change much with the age of 
a mouse. If this were not true we could in principle 
argue that we are dealing not with one but with many 
cell types, each having its characteristic gene expression 
profile. The question is therefore reduced to the deter- 
mination of the number of distinct cell types. 

Here, as for the gene expression calculations, we pick 
only one isoform per gene. Utilizing the digital nature of 
our sequencing, we select genes that have at least 100 
reads (hits) within the canonical mRNA transcript, as an 
accuracy cutoff. This yields 14259 genes for the decon- 
volution analysis. Taking the number of reads into ac- 
count allows us to estimate the standard error in this 
number, which we assume to be its square root, follow- 
ing the Poisson model. Hence, we can develop a weigh- 
ted least squares algorithm as described below. 

Let Nij/^ be the number of hits from sample / (out of 
our 8 temporal samples) to gene ; that came from cell 
type k Ni = Y.j,jJS[ijk is the number of hits from sample /, 
and Nij = Z aM)x is the number of hits from sample / to 
gene Let riijk ~ Nij/JNi denote the fraction of hits from 
gene ; and cell type k in sample /. It can be expressed as 
^ijk = ^ikJ^kp where w^^ ~ {lljNij/^)/Ni is the fraction of hits 
from cell type k in sample / and ri/^ ~ {lliNij/^)/(Li,jNij/^) is 
the fraction of hits in cell type k coming from gene y. 
Note that independent estimates of this fraction could 
be imagined, one for each sample /, if we do not sum 
over /; these would vary from sample to sample due to 
fluctuations. In the formulas above we used approximate 
equalities to reflect the fact that on the right-hand side 
are the data-based estimates of these quantities. 

We also assume that there are indeed K cell types, that 
gene expression within each cell type is identical in 
all samples irrespective of the tissue composition, and 
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that there are more samples than cell types (otherwise 
the error terms cannot be estimated), or that these 
assumptions are good approximations. We thus obtain 
Nij = Yl^NiWikUkj + \f^ij^ij where K is the number of 
cell types assumed, and all "normalized" errors are 
assumed to have mean zero and variance 1. Dividing the 
last equation by Ni and transcript length Ly, measured 

in bases, yields Sij = l/^WikCjg -\- ^^^^ -lO^^Siy , where 

Sij = ]^10^ is the RPKM of gene ; in sample /, and 
Ckj = ^ '10^ is the RPKM of gene ; in cell type k - 
exactly what we are looking for. 
The normalized error terms are = 

[Sij-Hk^ikCkj] and we want to minimize Z£| (when 
Nij = 0 in the last formula it is replaced by 1). For given 
cell type contributions Wi^ we take the last sum over /, 
for any gene ; separately. This will yield the estimates for 
Qy. Given these estimates, we can update the by 
summing the sfj over for each sample / separately. This 
procedure of optimizing for C/^j and then for Wi/^ can be 
run iteratively. If the algorithm works, we should expect 
the normalized error terms to have typical values of the 
order of 1. With our data, most noticeable changes oc- 
cur in first two to three iterations, after which the results 
stabilize. 

There are constraints to the allowed values for both 
for Ckj and Wi^: they must be non-negative. Negative 
values are replaced by zero. More precisely, the corre- 
sponding predictor with the most negative coefficient is 
excluded from the list of predictors, and then the mi- 
nimization is performed again with the number of pre- 
dictors reduced by one. If the significance p-value for a 
given predictor is above 0.05 it is also excluded, and 
minimization is repeated. The minimization procedure 
is stopped if either all the remaining predictors have 
positive coefficients with p-values below 0.05 or there is 
only one predictor left (in this case its coefficient is al- 
ways positive). In addition, H kWi^ =1 as we assume we 
include all the cell types present. This normalization is 
enforced after each iteration (for the reported deconvo- 
lution with 5 cell types, we also restrict the cell type 
fractions at 6dpp so that contribution from B is below 
4%, while C, D and E are zero; this leads to only minor 
differences with the unrestricted analysis). The distribu- 
tion of p-values for remaining predictors is shown in 
Additional file 1: Figure S14. 

We are fully aware that due to the stochastic nature of 
our algorithm, namely, due to data-dependent elimin- 
ation of some predictors, the standard p-value calcula- 
tions are incorrect. Still, we use them formally as we are 
not familiar with a sensible alternative suitable for our 



case, except bootstrapping, as discussed below. Addi- 
tionally, our goal here is not to test hypotheses, but to 
find for each gene the cell types that would explain its 
expression profile sufficiently well. We believe that, per- 
haps with some exceptions, our goal is achievable via the 
described procedure. 

A question that arises from the deconvolution analysis 
is the reliability of the predicted cell type-specific gene 
expression. We compare our predictions to microarray 
measurements from a cell-sorted study [6], and find them 
in reasonable agreement (Additional file 1: Figure S15). 
We also find a reasonable agreement between our de- 
convolution predictions and a recent study using RNA- 
seq to measure gene expression in cell-sorted samples 
[13] (Additional file 1: Figure S3). We note that on a 
genome-wide scale, it is doubtful that any approach, in- 
cluding direct cell type-specific expression measurements, 
will yield a completely unambiguous classification. 

To study the stability of our algorithm with respect to 
perturbations to gene expression, we performed a boot- 
strap. Assuming a Poisson distribution of hits for a given 
gene, with unknown rate, and using a Bayesian uniform 
rate prior yields a negative binomial distribution of hits 
with a mean n + 1 and a variance 2(n + 1), where n is the 
measured number of hits (to have a mean n and a vari- 
ance 2n, the prior should be inversely proportional to 
rate; however, in that case, if n = 0 the simulated distri- 
bution will always produce 0 as well). For 102 bootstrap 
runs, starting with initial cell type fractions and with 
perturbed gene expressions and running for ten itera- 
tions, we calculated the presence fractions for each gene, 
which are the fractions of bootstrap runs in which this 
gene is found expressed, for each cell type considered. 
We found that for each cell type there is a well-defined 
bimodal distribution of presence fractions, with maxima 
near 0 and 1 (i.e., genes are either consistently predicted 
not expressed, or expressed, in a given cell type). We de- 
fine a consensus prediction by a threshold of 0.5: if the 
presence fraction is greater than 0.5 the gene is present, 
and vice versa. The difference between the original 
(non-perturbed) and consensus results is small: the ratio 
of genes with different status to those with the same sta- 
tus (either expressed or not) is below 6% for all five cell 
types. If we define genes with a presence fraction of 95% 
or more as confidently expressed, and genes with a pres- 
ence fraction of 5% or less as confidently silent, then the 
ratios of confidently silent to consensus silent are be- 
tween 0.61 and 0.73 for the five cell types, and the ratios 
of confidently expressed to consensus expressed are 
0.92, 0.58, 0.48, 0.71 and 0.73 for A, B, C, D and E, re- 
spectively. These results show that for most of genes the 
algorithm produces stable predictions. 

Based on the estimated cell type-specific expression, 
and on the cell type fractions, it is possible to reconstruct 
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the temporal expression patterns of the genes. We com- 
pare such reconstructed expression patterns after 1 and 
10 iterations with the original, observed expression 
(Additional file 1: Figure S16 and S17). The agreement is 
excellent in both cases, and is better after 10 iterations, es- 
pecially for 38dpp. While such an agreement is not by it- 
self a ftill guarantee of the validity of our approach, it is 
necessary condition for it 

Another useful way to validate our cell type-specific 
gene expression predictions is to see how stable the pre- 
dictions are between different iterations, and between 
different numbers of the cell types considered. We con- 
sider 5, 6 and 7 cell type calculations, at iterations 1, 3 
and 10 and at different p-value cutoffs of 0.05 and 0.001. 
For 5 cell types, Bs (types A and B spermatogonia) and 
Bl (pre-leptotene and leptotene spermatocytes) have 
been combined into one type, B (cell fraction profiles for 
Bs and Bl are quite similar; a similarity for the profiles 
was one of the guides to establish the initial cell type 
groups for our analysis, based on [25,32]). For 7 cell 
types, A has been split into As and Ap, which are Sertoli 
and primitive spermatogonia A, based on [25,32]. 

We find that 5- and 6-cell type calculations agree well 
(Additional file 1: Table S8) - for any given iteration there 
are typically over 11,000 genes (out of 14,259; >77%) with 
a compatible predicted expression (compatibility/consis- 
tency means that if a gene is found expressed (silent) in a 
certain cell type in one calculation, it has to be expressed 
(silent) in the corresponding cell type(s) in the other cal- 
culations considered in the comparison as well). When, in 
addition, we look at compatible expression for different 
iterations (3 and 10) we still find over 50% in agree- 
ment. The comparison of 6- and 7-type is less consistent, 
indicating that 7 cell types are too many for the data 
(Additional file 1: Table S8). 

Additionally, we calculate Pearson and Spearman correl- 
ation coefficients for different cell type-specific gene ex- 
pression vectors (with elements being individual genes) 
between iterations, for 5- and 6-cell type calculations 
(Additional file 1: Table S9). The Pearson correlation is 
dominated by highly expressed genes, while the Spearman 
correlation results are close to the Pearson correlation for 
normalized gene expression (when sum of squares of ex- 
pression RPKM over the cell types is normalized to 1 for 
each gene; note that only genes with non-zero expression 
in at least one of the samples were selected for deconvolu- 
tion, so there is no ambiguity with non-expressed genes). 
The Pearson correlation is above 0.5 for both 5 and 6 cell 
types, while the difficulty in distinguishing Bs and Bl types 
is reflected in the low value of the Spearman coefficient 
especially for Bs in the 6-type calculation. 5- vs. 6-cell type 
cross-table correlations demonstrate similar tendencies. 
Overall, we find these results satisfactorily provide confi- 
dence in our deconvolution algorithm predictions. 



We note that in the absence of cell type fraction esti- 
mates, one could have used either the random guess or 
the non-negative matrix factorization approach [65]. The 
difficulty in such approaches, however, is in the ambigu- 
ity of the association of the biological cell types with the 
estimated profiles. We point out that a subtle distinction 
should in principle be made between the cell type frac- 
tions and cell type contributions to overall gene ex- 
pression, as different cell types could produce different 
amount of mRNA per cell; what we use in our approach 
is cell type contributions, and we assume that cell type 
fractions are similar. 

Mapping of microarray probe sets to UCSC known genes 

We compared our classification results to those in pre- 
viously published studies. These studies utilized micro- 
arrays for gene expression measurements. Schultz et al. 
[8] and Shima et al. [9] used Affymetrix MGU74 A,B,C 
v2 microarrays while Chalmel et al. [6] used MG430 
2.0 microarrays. To find a correspondence between our 
UCSC gene names and the MGU74 probesets we use the 
knownToAffyU74 table from the UCSC genome browser 
database. Chalmel et al. [6] provide Ensembl gene names 
for their probesets, so we could use knownToEnsembl in 
this case to determine the correspondence. 

Analysis of RNA-Seq from cell-sorted samples 

Soumillon et al. [13] performed RNA-Seq of sorted cell 
populations. Gene expression values for the five sorted cell 
types were downloaded from Gene Expression Omnibus, 
accession number GSE43717. Genes were clustered into 
five corresponding clusters using k-means with a cosangle 
distance measure, as described above. Lists of genes in the 
four clusters defined in [13] were obtained from the 
authors. Conversion from Ensembl gene names to gene 
symbols was done with the biomaRt package in R. 

Construction of splices 

Because our data consists of short (36 bp, trimmed to 
32 bp - see above) single-end reads, effective de novo 
splicing discovery is unfeasible. Therefore, we looked for 
potential splicing events between non-neighboring exons 
within each gene, for all genes. In search of novel splices 
we adopted the following strategy. All known isoforms 
of a gene contribute to its set of exons, and each pair 
of non-overlapping non-neighboring exons produced a 
candidate splice. We note that for known splices, only 
neighboring exons of each annotated isoform are consid- 
ered. 28 bp fragments from each exon were merged, in 
order to guarantee at least 4 bases present on either side 
of the splice (as we use 32 bp reads). A handful of cases 
of exons shorter than 28 bp resulted in a splice interval 
shorter than 56 bp. Replicates have been removed. In total, 
over 2 million splice intervals have been constructed, and 
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all the reads have been aligned to the extended genome 
consisting of the chromosome sequences plus the con- 
structed set of splice intervals. If a read did not align to 
the genome or a known splice, but aligned to an alterna- 
tive splice (thus skipping one or more exons in known iso- 
forms, or else forming an inter-known-isoform splice) it 
was considered as a candidate alternative splicing event. 

Polyadenylation 

As in the case of novel alternative splicing, we took into 
account the nature of our short single-end read library. 
To identify polyadenylation start sites, we considered 
reads that did not align to the genome or transcriptome 
as the candidate reads covering such sites. Candidate 
reads were expected to have either the [ACGT]nA36_n 
composition, for the transcript strand, or the T36_n 
[ACGT]n composition, for the opposite strand. The non- 
aligned reads were mapped to the genome by selecting 
sub-reads of varying lengths, skipping bases either at the 
end or at the beginning of the read, and finding the lon- 
gest possible alignment (the [ACGT]n part). Skipping 
bases at the beginning assumes that the read may be 
from a strand complementary to the mRNA and hence 
its 5' start corresponds to the 3' poly- A mRNA tail, 
while skipping bases at the 3' end of the read assumes 
that it is on the same strand as the mRNA. Each candi- 
date position had to have a polyadenylation signal se- 
quence ANTAAA within 50 bases upstream of it (in the 
transcript orientation), and the non-aligning part of the 
read had to be enriched for As (at the end of the read, 
for the transcript strand) or T s (at the beginning of the 
read, for the complementary strand), consisting of >75% 
of these bases. This threshold was lower than 100% to 
allow for sequencing errors in typically very short poly- 
A read stretches. 

Additional files 



Additional file 8: Is a table listing 59 lincRNA regions not 
overlapping UCSC known genes displaying variable and high 
expression through the time course of spermatogenesis. 

Additional file 9: Contains protein sequences of two predicted 
isoforms at genomic locus chr7:1 9684394-1 9699830. 
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